Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus and its Applications
نویسندگان
چکیده
This paper describes how to extract Chinese frequent strings without using a dictionary. In this paper, we generalize the notations of words and unknown words to those of frequent strings. The Chinese frequent strings (CFSs) we define include words, unknown words, and other strings that are frequently used. Some examples of CFSs are “ (can only let)”, “ (every minute and every second)”, “ (bearing in mind the interest of each other)”, and “ (and nobody)”. A CFS is very useful in Chinese natural language processing and its related applications. We show its application to the following three tasks: Chinese phoneme-to-character conversion, Chinese character-to-phoneme conversion, and the determination of prosodic segments in a Chinese sentence for text-to-speech output. We have also developed a simple method to extract CFSs from a corpus. The method we propose can automatically detect such strings without the use of any lexicon, and no word segmentation is needed. We also can extract unknown words in a corpus which consist of three of more words. Such words (e.g. ) usually cannot be extracted by using a concatenation approach.
منابع مشابه
The Properties and Further Applications of Chinese Frequent Strings
This paper reveals some important properties of CFSs and applications in Chinese natural language processing (NLP). We have previously proposed a method for extracting Chinese frequent strings that contain unknown words from a Chinese corpus [Lin and Yu 2001]. We found that CFSs contain many 4-character strings, 3-word strings, and longer n-grams. Such information can only be derived from an ex...
متن کاملStatistical Augmentation of a Chinese Machine-Readable Dictionary
We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-speciic and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improv...
متن کاملIterative Chinese Bi-gram Term Extraction Using Machine-learning Classification Approach
This paper presents an iterative approach to extracting Chinese terms. Unlike the traditional approach to extracting Chinese terms, which requires the assistance of a dictionary, the proposed approach exploits the Support Vector Machine classifier which learns the extraction rules from the occurrences of a single popular term in the corpus. Additionally, we have designed a very effective featur...
متن کامل“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press
This article studies the depiction of Chinese miners in the Ghanaian news website entitled Modern Ghana. A total of 87 articles comprising 43752 words were retrieved. Van Leeuwen’s (2008) theory of the representation of the social actors was utilised to examine the depiction of Chinese miners in the Ghanaian press. In this regard, six applicable tools were used and these include exclusion, role...
متن کاملExploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...
متن کامل